-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too #23054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan |
|
makes sense to me. This is a behavior change right? Shall we write a migration guide? |
|
Test build #98891 has finished for PR 23054 at commit
|
|
Ok. Let me update migration guide. |
|
Test build #98902 has finished for PR 23054 at commit
|
|
retest this please |
|
Test build #98907 has finished for PR 23054 at commit
|
|
Test build #98961 has finished for PR 23054 at commit
|
|
We should add a “legacy” flag in case somebody’s workload gets broken by this. We can remove the legacy flag in a future release. |
|
Ok. I will add a flag. Thanks @rxin |
|
Test build #98977 has finished for PR 23054 at commit
|
|
retest this please. |
|
Test build #98978 has finished for PR 23054 at commit
|
|
retest this please. |
|
Test build #98976 has finished for PR 23054 at commit
|
|
Test build #98981 has finished for PR 23054 at commit
|
|
BTW what does the non-primitive types look like? Do they get flattened, or is there a struct ? |
|
For struct types there is a struct named "key". |
docs/sql-migration-guide-upgrade.md
Outdated
|
|
||
| - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. | ||
|
|
||
| - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that, only struct type key has the key alias. So here we should say: if the key is non-struct type, e.g. int, string, array, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. More accurate.
| .createWithDefault(false) | ||
|
|
||
| val LEGACY_ATOMIC_KEY_ATTRIBUTE_GROUP_BY_KEY = | ||
| buildConf("spark.sql.legacy.atomicKeyAttributeGroupByKey") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.legacy.dataset.aliasNonStructGroupingKey?
| val keyColumn = if (!kExprEnc.isSerializedAsStruct) { | ||
| assert(groupingAttributes.length == 1) | ||
| groupingAttributes.head | ||
| if (SQLConf.get.aliasNonStructGroupingKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should do the alias when config is true...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, don't we want to have "key" attribute and only have old "value" attribute when we turn on legacy config?
|
Test build #98988 has finished for PR 23054 at commit
|
| .createWithDefault(false) | ||
|
|
||
| val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY = | ||
| buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe aliasNonStructGroupingKeyAsValue, and default to false.
Then we can remove this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. That makes sense. Thanks.
|
Test build #99009 has finished for PR 23054 at commit
|
|
Test build #99010 has finished for PR 23054 at commit
|
|
sorry it conflicts, can you resolve it? I think it's ready to go |
|
@cloud-fan Yea, it's resolved. Thanks. |
|
Test build #99035 has finished for PR 23054 at commit
|
|
Test build #99096 has finished for PR 23054 at commit
|
|
retest this please |
|
Test build #99097 has finished for PR 23054 at commit
|
|
retest this please... |
|
Test build #99103 has finished for PR 23054 at commit
|
|
hmmm it conflicts again... |
|
yea, resolved again. :) |
|
Test build #99146 has finished for PR 23054 at commit
|
|
thanks, merging to master! |
…gation should be named as "key" too ## What changes were proposed in this pull request? When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type. ## How was this patch tested? Added test. Closes apache#23054 from viirya/SPARK-26085. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type.
How was this patch tested?
Added test.